{
  "metadata": {
    "kernelspec": {
      "name": "python",
      "display_name": "Python (Pyodide)",
      "language": "python"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "python",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8"
    }
  },
  "nbformat_minor": 5,
  "nbformat": 4,
  "cells": [
    {
      "cell_type": "markdown",
      "source": "<a target=\"_blank\" href=\"https://nbviewer.org/github/jqnatividad/qsv/blob/master/contrib/notebooks/Whirlwindtour.ipynb\">\n    <img src=\"https://img.shields.io/badge/nbviewer-nbviewer.org-orange?logo=jupyter&label=View%20online\" alt=\"View online on nbviewer.org\">\n</a>\n<a target=\"_blank\" href=\"https://github.com/jqnatividad/qsv\">\n    <img src=\"https://img.shields.io/badge/github-jqnatividad%2Fqsv-white?logo=github&label=Source%20Code\" alt=\"View the source code on GitHub\">\n</a>\n",
      "metadata": {},
      "id": "f8466df2-007d-4d77-928c-6cca46abc97d"
    },
    {
      "cell_type": "markdown",
      "source": "### A whirlwind tour\n\n> ℹ️ **NOTE:** This tour is primarily targeted to Linux and macOS users. Though qsv works on Windows, the tour\nassumes basic knowledge of command-line piping and redirection, and uses other command-line tools (curl, tee, head, etc.)\nthat are not installed by default on Windows.\n\nLet's say you're playing with some data from the\n[Data Science Toolkit](https://github.com/petewarden/dstkdata), which contains\nseveral CSV files. Maybe you're interested in the population counts of each\ncity in the world. So grab the 124MB, 2.7M row CSV file and start examining it:\n",
      "metadata": {},
      "id": "ba3d7197-d56d-40ce-94bc-9e8366668808"
    },
    {
      "cell_type": "code",
      "source": "!qsv headers wcp.csv",
      "metadata": {},
      "execution_count": 1,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "1   Country\n\n2   City\n\n3   AccentCity\n\n4   Region\n\n5   Population\n\n6   Latitude\n\n7   Longitude\n"
        }
      ],
      "id": "0495a205-d3d3-44a3-b41c-f381aee6d3c2"
    },
    {
      "cell_type": "markdown",
      "source": "The next thing you might want to do is get an overview of the kind of data that\nappears in each column. The `stats` command will do this for you:",
      "metadata": {},
      "id": "c4b8013e-d107-43bc-88aa-443d56f7c70f"
    },
    {
      "cell_type": "markdown",
      "source": "!qsv stats wcp.csv | qsv table",
      "metadata": {},
      "id": "e0c644af-8de2-4b84-8aa0-ac550654c2f4"
    },
    {
      "cell_type": "markdown",
      "source": "Wow! That was fast! It took just 1.3 seconds to compile all that.[^1] One reason for qsv's speed\nis that ***it mainly works in \"streaming\" mode*** - computing statistics as it \"streams\"\nthe CSV file line by line. This also means it can gather statistics on arbitrarily large files,\nas it does not have to load the entire file into memory.[^2]\n\nBut can we get more summary statistics? What's the variance, the modes, the distribution (quartiles), \nand the cardinality of the data?  No problem. That's why `qsv stats` has an `--everything` option to \ncompute these more \"expensive\" stats. Expensive - as these extended statistics can only be computed at \nthe cost of loading the entire file into memory.\n",
      "metadata": {},
      "id": "72360a9b-fba1-478f-b0d4-5943df9ac1c3"
    },
    {
      "cell_type": "code",
      "source": "!qsv stats wcp.csv --everything | qsv table",
      "metadata": {},
      "execution_count": 8,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "field       type     sum            min           max         range     min_length  max_length  mean        stddev       variance          nullcount  sparsity  mad      lower_outer_fence  lower_inner_fence  q1       q2_median  q3       iqr      upper_inner_fence  upper_outer_fence  skewness  cardinality  mode         mode_count  mode_occurrences  antimode                                                                                                 antimode_count  antimode_occurrences\n\nCountry     String                  ad            zw                    2           2                                                      0          0                                                                                                                                              231          ru           1           176934            cc,nf,pn,tf,tk                                                                                           5               1\n\nCity        String                   al lusayli   ï¿½ykkvibaer            1           87                                                     0          0                                                                                                                                              2008182      san jose     1           313               *PREVIEW:  al lusayli, amiri-ye `olya, bab el ahmar, baqeri `olya, chahar tang-e bala, chefe muzaman...  1741957         1\n\nAccentCity  String                   Al Lusayli   ï¿½ï¿½ï¿½zler               1           87                                                     0          0                                                                                                                                              2025934      San Antonio  1           307               *PREVIEW:  Al Lusayli, Amiri-ye `Olya, Baqeri `Olya, Bï¿½b el Ahmar, Chahar Tang-e Bala, Chefe Muzam...    1762562         1\n\nRegion      String                  00            Z4                    0           2                                                      4          0                                                                                                                                              392          04           1           143900            *PREVIEW: H1,H6,H9,I1,I4,K5,K6,L1,L7,M8                                                                  17              1\n\nPopulation  Integer  2290536125     7             31480498    31480491  0           8           48730.6639  308414.0419  95119221210.8846  2652350    0.9826    8327     -69766.5           -33018             3730.5   10879      28229.5  24499    64978              101726.5           0.4164    28460                     1           2652350           *PREVIEW: 100,10001,10002,100023,10003,10005,10007,10008,10009,100105                                    18970           1\n\nLatitude    Float    76585211.1978  -54.9333333   82.483333   137.4167  1           12          28.3717     21.9384      481.2922          0          0         15.2667  -84.7706           -35.9076           12.9553  33.8667    45.5306  32.5753  94.3935            143.2564           -0.2839   255133       50.8         1           1128              *PREVIEW: -.020556,-.025833,-.035,-.035833,-.047222,-.051944,-.054722,-.068889,-.071111,-.100833         79002           1\n\nLongitude   Float    75976506.6643  -179.9833333  180.0       359.9833  1           14          28.1462     62.4729      3902.8581         0          0         30.9583  -199.3667          -98.4917           2.3833   26.8803    69.6333  67.25    170.5083           271.3833           0.2715    407568       23.1         1           590               *PREVIEW: -.185556,-.208056,-.246944,-.248333,-.263889,-.271667,-.276389,-.285,-.287222,-.288889         162640          1\n"
        }
      ],
      "id": "af448479-b6f5-4ba3-89b7-c3d124b7c5c5"
    },
    {
      "cell_type": "markdown",
      "source": "> ℹ️ **NOTE:** The `qsv table` command takes any CSV data and formats it into aligned columns\nusing [elastic tabstops](https://github.com/BurntSushi/tabwriter). You'll\nnotice that it even gets alignment right with respect to Unicode characters.\n",
      "metadata": {},
      "id": "c8623cc4-1426-4214-bd4e-8a6be9a68210"
    },
    {
      "cell_type": "markdown",
      "source": "So, this command took 3.22 seconds to run on my machine, but we can speed\nit up by creating an index and re-running the command:",
      "metadata": {},
      "id": "8f5a7d90-fbb0-4fe9-9866-223f81ed5fc2"
    },
    {
      "cell_type": "markdown",
      "source": "```\nqsv index wcp.csv\nqsv stats wcp.csv --everything | qsv table\n```",
      "metadata": {},
      "id": "ab1d059c-bb8b-42c7-aa04-f64f32e5f825"
    },
    {
      "cell_type": "markdown",
      "source": "Which cuts it down to 1.95 seconds - 1.65x faster! (And creating the 21.6mb index took 0.27 seconds. \nWhat about the first `stats` without `--everything`? From 1.3 seconds to 0.16 seconds with an index - 8.25x faster!)",
      "metadata": {},
      "id": "6c2c1a14-e61f-4951-9f1a-bad27840a017"
    },
    {
      "cell_type": "markdown",
      "source": "Notably, the same type of \"statistics\" command in another\n[CSV command line toolkit](https://csvkit.readthedocs.io/)\ntakes about 10 seconds to produce a *subset* of statistics on the same data set. [Visidata](https://visidata.org)\ntakes much longer - ~1.5 minutes to calculate a *subset* of these statistics with its Describe sheet. \nEven python [pandas'](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) \n`describe(include=\"all\"))` took 12 seconds to calculate a *subset* of qsv's \"streaming\" statistics.[^3]\n\nThis is another reason for qsv's speed. Creating an index accelerated statistics gathering as it enables \n***multithreading & fast I/O***.\n\n**For multithreading** - running `stats` with an index was 8.25x faster because it divided the file into \n16 equal chunks[^1] with ~170k records each, then running stats on each chunk in parallel across 16 \nlogical processors and merging the results in the end. It was \"only\" 8x, and not 16x faster as there is \nsome overhead involved in multithreading. \n\n**For fast I/O** - let's say you wanted to grab the last 10 records:",
      "metadata": {},
      "id": "eb7e899a-175f-4bef-882a-7a1dae15ef1a"
    },
    {
      "cell_type": "code",
      "source": "!qsv count --human-readable wcp.csv",
      "metadata": {},
      "execution_count": 12,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "2,699,354\n"
        }
      ],
      "id": "d8f63f5d-9107-4fbd-af81-614591318690"
    },
    {
      "cell_type": "code",
      "source": "!qsv slice wcp.csv --start -10 | qsv table",
      "metadata": {},
      "execution_count": 14,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "Country  City               AccentCity         Region  Population  Latitude     Longitude\n\nzw       zibalonkwe         Zibalonkwe         06                  -19.8333333  27.4666667\n\nzw       zibunkululu        Zibunkululu        06                  -19.6666667  27.6166667\n\nzw       ziga               Ziga               06                  -19.2166667  27.4833333\n\nzw       zikamanas village  Zikamanas Village  00                  -18.2166667  27.95\n\nzw       zimbabwe           Zimbabwe           07                  -20.2666667  30.9166667\n\nzw       zimre park         Zimre Park         04                  -17.8661111  31.2136111\n\nzw       ziyakamanas        Ziyakamanas        00                  -18.2166667  27.95\n\nzw       zizalisari         Zizalisari         04                  -17.7588889  31.0105556\n\nzw       zuzumba            Zuzumba            06                  -20.0333333  27.9333333\n\nzw       zvishavane         Zvishavane         07      79876       -20.3333333  30.0333333\n"
        }
      ],
      "id": "ae7a1bbf-e129-4f28-9001-3c7643f41ebb"
    },
    {
      "cell_type": "markdown",
      "source": "`qsv count` took 0.006 seconds and `qsv slice`, 0.017 seconds! These commands are *instantaneous* \nwith an index because for `count` - the index already precomputed the record count, and with `slice`,\n*only the sliced portion* has to be parsed - because an index allowed us to jump directly to that \npart of the file. It didn't have to scan the entire file to get the last 10 records. For comparison,\nwithout an index, it took 0.25 (41x slower) and 0.66 (39x slower) seconds respectively.\n\n> ℹ️ **NOTE:** Creating/updating an index itself is extremely fast as well. If you want\nqsv to automatically create and update indices, set the environment var `QSV_AUTOINDEX`.\n\nOkay, okay! Let's switch gears and stop obsessing over how fast :rocket: qsv is... let's go back to exploring :mag_right:\nthe data set.\n\nHmmmm... the Population column has a lot of null values. How pervasive is that?\nFirst, let's take a look at 10 \"random\" rows with `sample`. We use the `--seed` parameter\nso we get a reproducible random sample. And then, let's display only the Country,\nAccentCity and Population columns with the `select` command.",
      "metadata": {},
      "id": "0f10b533-dc18-4cea-8936-3a58fff526fc"
    },
    {
      "cell_type": "code",
      "source": "!qsv sample --seed 42 10 wcp.csv | qsv select Country,AccentCity,Population | qsv table",
      "metadata": {},
      "execution_count": 15,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "Country  AccentCity              Population\n\nco       Guamal                  5026\n\naf       Jalat Kalay             \n\nro       Caragheorghevici        \n\nmm       Padauk                  \n\ntr       Vakifbeloren            \n\nmx       Chupadero de Alcaparra  \n\nir       Gangar Kola             \n\ntg       Folo                    \n\npe       Huancas                 \n\npk       Dilawanwala             \n"
        }
      ],
      "id": "ad0597f0-22c8-4e2b-a5fa-f8cbe5392c91"
    },
    {
      "cell_type": "markdown",
      "source": "Whoops! The sample we got don't have population counts. It's quite pervasive. Exactly how many cities have empty (NULL) population counts?",
      "metadata": {},
      "id": "4c037b4c-0896-4de3-9237-6cd86a548087"
    },
    {
      "cell_type": "code",
      "source": "!qsv frequency wcp.csv --limit 3 | qsv table",
      "metadata": {},
      "execution_count": 16,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "field       value        count\n\nCountry     ru           176934\n\nCountry     us           141989\n\nCountry     cn           117508\n\nCity        san jose     313\n\nCity        san antonio  310\n\nCity        santa rosa   288\n\nAccentCity  San Antonio  307\n\nAccentCity  Santa Rosa   288\n\nAccentCity  Santa Cruz   268\n\nRegion      04           143900\n\nRegion      02           127736\n\nRegion      03           105455\n\nPopulation  (NULL)       2652350\n\nPopulation  2310         12\n\nPopulation  983          11\n\nLatitude    50.8         1128\n\nLatitude    50.95        1076\n\nLatitude    50.6         1043\n\nLongitude   23.1         590\n\nLongitude   23.2         586\n\nLongitude   23.05        575\n"
        }
      ],
      "id": "d3090c4d-18dd-4dbc-987e-bcf4c751df4b"
    },
    {
      "cell_type": "markdown",
      "source": "(The `qsv frequency` command builds a frequency table for each column in the\nCSV data. This one only took 1.8 seconds.)\n\nSo it seems that most cities do not have a population count associated with\nthem at all (2,652,350 to be exact). No matter — we can adjust our previous \ncommand so that it only shows rows with a population count:\n",
      "metadata": {},
      "id": "571d99d7-c867-4c74-bb6f-abd9fcf9f2a6"
    },
    {
      "cell_type": "code",
      "source": "!qsv search --select Population '[0-9]' wcp.csv | qsv sample --seed 42 10 | qsv select Country,AccentCity,Population | tee sample.csv | qsv table",
      "metadata": {},
      "execution_count": 19,
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": "'tee' is not recognized as an internal or external command,\n\noperable program or batch file.\n"
        }
      ],
      "id": "60f3bb63-aff3-4023-bdb1-c938016abc81"
    },
    {
      "cell_type": "markdown",
      "source": "> ℹ️ **NOTE:** The `tee` command reads from standard input and writes \nto both standard output and one or more files at the same time. We do this so \nwe can create the `sample.csv` file we need for the next step, and pipe the \nsame data to the `qsv table` command.<br/>Why create `sample.csv`? Even though qsv is blazing-fast, we're just doing an \ninitial investigation and a small 10-row sample is all we need to try out and\ncompose the different CLI commands needed to wrangle the data.\n",
      "metadata": {},
      "id": "81c813a8-1296-4df7-a3f4-2ddded7fd4fd"
    },
    {
      "cell_type": "markdown",
      "source": "Erk. Which country is `sv`? What continent? No clue, but [datawookie](https://github.com/datawookie) \nhas a CSV file called `country-continent.csv`.",
      "metadata": {},
      "id": "640ad7a5-ce97-44a4-aa19-2f43a8133fcf"
    },
    {
      "cell_type": "code",
      "source": "",
      "metadata": {},
      "execution_count": null,
      "outputs": [],
      "id": "ac0b8bb8-6249-4244-b759-6fe0568e2b4b"
    },
    {
      "cell_type": "markdown",
      "source": "\n```\n$ curl -L https://raw.githubusercontent.com/datawookie/data-diaspora/master/spatial/country-continent-codes.csv > country_continent.csv\n$ qsv headers country_continent.csv\n1 # https://datahub.io/JohnSnowLabs/country-and-continent-codes-list\n```",
      "metadata": {},
      "id": "aab56b5d-1a12-4086-8d69-fae8304edb66"
    },
    {
      "cell_type": "markdown",
      "source": "Huh!?! That's not what we we were expecting. But if you look at the country-continent.csv file, it starts with a comment with the # character.",
      "metadata": {},
      "id": "dbccfec9-385b-4494-9fa3-42d3ab940ef9"
    },
    {
      "cell_type": "code",
      "source": "!qsv count wcp.csv",
      "metadata": {},
      "execution_count": 1,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "2699354\n"
        }
      ],
      "id": "4550f7fd-acb7-4a9e-9ef5-e79855324c1d"
    },
    {
      "cell_type": "code",
      "source": "",
      "metadata": {},
      "execution_count": null,
      "outputs": [],
      "id": "9d723561-42bd-45fe-af2e-c17d17c41e8c"
    }
  ]
}